Ran into problems with code in previous notebook due to complicated way of mapping from the list of DIP entry proteins to the uniprot identifiers to the Entrez identifiers. Simplifying this process to avoid erroneous entries. Writing code to:
Code from previous notebook worked fine for this, but failed to preserve information about which interactions existed. Making a few small changes to ensure that works.
In [8]:
cd /home/gavin/Documents/MRes/DIP/human/
In [9]:
ls
In [10]:
import csv
c = csv.reader(open("Hsapi20140427.txt"), delimiter="\t")
#skip first line
c.next()
#take out the first two strings for every line
IDstrings = map(lambda x: (x[0],x[1]), c )
In [11]:
# want to go through this and remove the uniprot and refseq identities
# unfortunately, fastest way to write this is with a list comprehension
IDstrings = [(x.split("|")[0],y.split("|")[0]) for (x,y) in IDstrings if "DIP" in x.split("|")[0] and "DIP" in y.split("|")[0]]
In [12]:
from collections import OrderedDict
#flatten and remove duplicate entries, but keep the original structure
flatIDstrings = list(OrderedDict.fromkeys(flatten(IDstrings)))
print "Number of proteins in DIP human dataset is %i"%len(flatIDstrings)
Now there are two variables in use:
IDstrings
- A list of pairs of interacting proteins in the DIP format.flatIDstrings
- A list of all proteins in the above list, with redundancies removed.The next step is to convert both of these lists to Uniprot and Entrez identifiers. However, this is not a one to one tranformation. Some identifiers could map to two identifiers or none. This will have to be taken into account to avoid interactions between protein identifiers that don't exist.
The dictionary can be built using the Uniprot ID mapping service with the flat file used as input. Doing this, we can load this file back in and create a Python dictionary from it:
In [13]:
#writing the flattened protein IDs to file
c = csv.writer(open("flat.DIP.txt", "w"), delimiter="\n")
c.writerow(flatIDstrings)
The Uniprot ID mapping service returned:
3,509 out of 3,844 identifiers mapped to 3,467 identifiers in the target data set
Reading in this file:
In [14]:
c = csv.reader(open("DIPtouniprot.tab"), delimiter="\t")
# skip first row
c.next()
cl = list(c)
# build dictionary
DIPtouni = {}
for l in cl:
#place an empty list in for every DIP protein mapped
DIPtouni[l[0]] = []
for l in cl:
#fill the lists with uniprot identifiers
DIPtouni[l[0]].append(l[1])
Now can convert arbitrary DIP identifiers to uniprot:
In [15]:
print DIPtouni[flatIDstrings[8]]
Converting the entire set of DIP identifiers to Uniprot while:
In [16]:
import itertools
uniIDstrings = []
faildict = {}
for l,k in zip(IDstrings,range(len(IDstrings))):
# try to retreive entries for these proteins
try:
unip = map(lambda x: DIPtouni[x], l)
if len(unip[0]) >= 1 or len(unip[1]) >= 1:
#iterate over combinations of both:
for i in itertools.product(unip[0],unip[1]):
uniIDstrings.append(i)
except KeyError:
faildict[k] = l
print "Failed to map %i of %i entries."%(len(faildict.values()), len(IDstrings))
The flattened list of protein identifiers must also be mapped to Uniprot, but this is simply the values contained in the dictionary used to convert from DIP to Uniprot identifiers. Writing these to a file these can then be used to convert to Entrez identifiers:
In [17]:
c = csv.writer(open("flat.uniprot.txt", "w"), delimiter="\n")
flatuniprot = list(flatten(DIPtouni.values()))
c.writerow(flatuniprot)
Using this on the Uniprot mapping service returns:
3,325 out of 3,511 identifiers mapped to 3,437 identifiers in the target data set
Can then load in the resulting table as a dictionary again, and use this to convert the uniprot list of interacting pairs:
In [18]:
c = csv.reader(open("uniprottoEntrez.tab"), delimiter="\t")
# skip first row
c.next()
cl = list(c)
# build dictionary
unitoEnt = {}
for l in cl:
unitoEnt[l[0]] = []
for l in cl:
unitoEnt[l[0]].append(l[1])
In [19]:
import itertools
EntIDstrings = []
faildictEnt = {}
for l,k in zip(uniIDstrings,range(len(uniIDstrings))):
# try to retreive entries for these proteins
try:
unip = map(lambda x: unitoEnt[x], l)
if len(unip[0]) >= 1 or len(unip[1]) >= 1:
#iterate over combinations of both:
for i in itertools.product(unip[0],unip[1]):
EntIDstrings.append(i)
except KeyError:
faildictEnt[k] = l
print "Failed to map %i of %i entries."%(len(faildictEnt.values()), len(uniIDstrings))
Which produces the list of interacting pairs in Entrez format. Some example entries can be shown:
In [20]:
print EntIDstrings[0:10]
There are many files we wish to make from this processed data. The naming scheme for these files is given below:
flat.DIP.txt
flat.uniprot.txt
flat.Entrez.txt
interacting.DIP.csv
interacting.uniprot.csv
interacting.Entrez.cvs
It's also important that the entries which failed to map are removed from the previous identifier's list, to make all lists compatible.
This can be achieved easily using the faildict
variables defined above:
In [21]:
#use indexes in faildict to zero entries we're not interested in
for k in faildict.keys():
IDstrings[k] = 0
#the use ifilter to remove these entries
IDstrings = list(itertools.ifilter(lambda x: x!=0, IDstrings))
In [22]:
#then do the same for uniIDstrings
for k in faildictEnt.keys():
uniIDstrings[k] = 0
uniIDstrings = list(itertools.ifilter(lambda x: x!=0, uniIDstrings))
In [23]:
#redefine the flattened list of uniprot IDs
flatuniprot = list(unitoEnt.keys())
#remove elements from flattened DIP IDs that don't map to Entrez
flatIDstrings = [x for x in DIPtouni.keys() if all([a in flatuniprot for a in DIPtouni[x]])]
#regenerate flat files:
csv.writer(open("flat.DIP.txt", "w"), delimiter="\n").writerow(flatIDstrings)
csv.writer(open("flat.uniprot.txt", "w"), delimiter="\n").writerow(flatuniprot)
In [24]:
#then save the flattened Entrez IDs
flatEnt = list(flatten(unitoEnt.values()))
csv.writer(open("flat.Entrez.txt", "w"), delimiter="\n").writerow(flatEnt)
In [25]:
#then save all of the interaction tables
csv.writer(open("interacting.DIP.txt", "w"), delimiter="\t").writerows(IDstrings)
csv.writer(open("interacting.uniprot.txt", "w"), delimiter="\t").writerows(uniIDstrings)
csv.writer(open("interacting.Entrez.txt", "w"), delimiter="\t").writerows(EntIDstrings)
At this point it seems worthwhile to check if there are Bait or Prey proteins in our experimental results which are represented in this list of interacting proteins. As this is our ground truth which will be used to train our prediction algorithm any interactions present in the DIP dataset must be treated as real. In other words, our prediction algorithm can't outperform the training set.
Loading in the bait and prey lists and combining them:
In [26]:
cd /home/gavin/Documents/MRes/forGAVIN/pulldown_data/PREYS/
In [27]:
ls
In [28]:
%%bash
head prey_entrez_ids.csv
In [29]:
preyids = list(flatten(csv.reader(open("prey_entrez_ids.csv"))))
In [30]:
cd /home/gavin/Documents/MRes/forGAVIN/pulldown_data/BAITS/
In [31]:
ls
In [32]:
baitids = list(flatten(csv.reader(open("baits_entrez_ids.csv"))))
In [33]:
#check the crossover
crossover = [x for x in baitids+preyids if x in flatEnt]
In [34]:
print "%i of %i Entrez IDs in bait and prey lists are found in DIP dataset"%(len(crossover),len(baitids+preyids))
So about a quarter of the proteins we're interested in are found in the DIP dataset. However, this doesn't mean that a quarter of the interactions we're trying to find are already solved for. There could be interactions between any of these crossover proteins that are not accounted for in the DIP dataset.
Colin asked what this is, and it's pretty easy to calculate:
In [35]:
baitcrossover = [x for x in baitids if x in flatEnt]
print "%i of %i Entrez IDs in bait list are found in DIP dataset"%(len(baitcrossover),len(baitids))